• Our Approach
  • Step 1: Make the plot for “Jessie”
  • Step 2: Make the graphic for all 35 names
  • Word count
  • Standards
  • References

Our Approach

The goal of this project was to mimic the data graphic from the Flowing Data post by finding and then plotting the most unisex names (meaning equally assigned to female and male babies) found in the babynames package. A general strategy we employed was to go through the Jessie start code line by line to figure out what each part was doing, which helped us successfully generalize it later.

To start, we found the root mean squared error (RMSE) for the name “Jessie”. Then, to find the RMSE for all names, we first considered how we could make the data from the babynames package more manageable. First, we filtered by the years in between 1930 and 2012 and then removed rows in which there was a NA value (meaning that the name was not assigned to either males or females.) We also reshaped the data by applying pivot_wider. After altering the data, we created a function called find_rmse that adds columns error and squared error and then summarizes that data to give the mean squared error (mse) and root mean squared error (rmse). We then applied this function to the altered data, called all_babies, that we grouped by name.

We tried various things when filtering to get the top 35 most unisex names: filtering out names with NA in the M or F column, finding the total occurrences of each name and considering the most popular names, and finding the total years in which each name appeared and including names that occurred in at least 70 years. We were not entirely successful at matching the original data graphic.

To gather the data for the time series, we filtered the original babynames data frame to include only the years of interest. We rearranged and mutated the data based on the “Jessie” plot code. To limit the names included to only the top 35 most unisex ones, we did an inner join between our modified babynames data frame and the data frame we created earlier with the 35 most unisex names.

To draw the points representing the most unisex years, we started with the code to draw the point for the name “Jessie” and wrote a function to generalize the process. Then we used map_dfr() to iterate those steps over the list of the top 35 most unisex names.

To create the annotations for the exhibit, we used the tribble() or transposed tibble command to manually create the descriptions for each notable name trend. This enabled us to construct a new data frame in which each row corresponds to a single segment, such as the year and the composition of boys to girls ratio.

Finally, to draw the plot, we used a combination of the line, area, point, path, and text geoms. We mimicked a lot of elements from the sample Jessie plot, including the fill and scale of the y axis, but changed other elements including but not limited to adding a facet wrap based on name and adding annotations with segments to the plots of certain names.

Copy the Master Assignment

Step 1: Make the plot for “Jessie”

Step 1A: Gather the data for “Jessie”

jessie <- babynames %>%
  filter(
    name == "Jessie", 
    year >= 1930 & year < 2012
  ) %>%
  select(-prop) %>%
  pivot_wider(names_from = sex, values_from = n) %>%
  mutate(pct_girls = F / (F + M))
jessie
ABCDEFGHIJ0123456789
year
<dbl>
name
<chr>
F
<int>
M
<int>
pct_girls
<dbl>
1930Jessie219613300.6228020
1931Jessie193012670.6036910
1932Jessie189512820.5964747
1933Jessie180710770.6265603
1934Jessie179310910.6217060
1935Jessie161811030.5946343
1936Jessie158610130.6102347
1937Jessie155210400.5987654
1938Jessie14759710.6030253
1939Jessie139610580.5688672

Step 1B: Compute the “most unisex year”

jessie_unisex_year <- jessie %>%
  mutate(distance = abs(pct_girls - 0.5)) %>%
  arrange(distance) %>%
  head(1)
jessie_unisex_year
ABCDEFGHIJ0123456789
year
<dbl>
name
<chr>
F
<int>
M
<int>
pct_girls
<dbl>
distance
<dbl>
1949Jessie103110230.50194740.00194742

Step 1C: Add the annotations for “Jessie”

jessie_context <- tribble(
  ~year_label, ~vpos, ~hjust, ~name, ~text,
  1934, 0.35, "left", "Jessie", "Most\nunisex year"
)

jessie_segments <- tribble(
  ~year, ~pct_girls, ~name,
  1940, 0.43, "Jessie",
  1940, 0.5, "Jessie",
  1949, 0.4956897, "Jessie"
)

jessie_labels <- tribble(
  ~year, ~name, ~pct_girls, ~label,
  1998, "Jessie", 0.8, "BOYS",
  1998, "Jessie", 0.2, "GIRLS"
)

Step 1D: Draw the plot for “Jessie”

ggplot(jessie, aes(x = year, y = pct_girls)) +
  geom_line() +
  geom_area(fill = "#eaac9e") +
  geom_point(data = jessie_unisex_year, fill = "white", pch = 21, size = 3) +
  geom_path(data = jessie_segments) +
  geom_text(
    data = jessie_labels, 
    aes(label = label), 
    color = "white"
  ) +
  geom_text(
    data = jessie_context, family = "Century Gothic",
    aes(x = year_label, y = vpos, label = text, hjust = hjust), vjust = "top"
  ) +
  scale_y_continuous(NULL, 
    limits = c(0, 1),
    breaks = c(0, 0.5, 1),
    labels = scales::percent, 
    expand = c(0,0)
  ) +
  scale_x_continuous(breaks = c(1940, 1960, 1980, 2000), 
                     labels = c("1940", "'60", "'80", "2000"), 
                     expand = c(0,0), 
                     NULL) +
  scale_fill_manual(values = c("#eaac9e", "black")) +
  theme(
    panel.background = element_rect(fill = "#92bdd3"),
    axis.ticks.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
#    text = element_text(family = "Century Gothic"),
    strip.background = element_blank(),
    strip.text = element_text(hjust = 0, face = "bold", size = 14)
  ) +
  guides(fill = FALSE) +
  labs(
    title = "1. Jessie", 
    caption = "Source: Social Security Administration | By http://flowingdata.com"
  )

Step 2: Make the graphic for all 35 names

Make the full data graphic with the 35 most gender-neutral names:

Or at least, make an attempt that’s as good as mine:

This bit of code will create a data frame with the 35 names as ranked by FlowingData.com. You can use this to check your work, but note that to meet the standard for computing the names, you need to discover these names algorithmically.

fd_names <- c(
  "Jessie", "Marion", "Jackie", "Alva", "Ollie",
  "Jody", "Cleo", "Kerry", "Frankie", "Guadalupe",
  "Carey", "Tommie", "Angel", "Hollis", "Sammie",
  "Jamie", "Kris", "Robbie", "Tracy", "Merrill",
  "Noel", "Rene", "Johnnie", "Ariel", "Jan",
  "Devon", "Cruz", "Michel", "Gale", "Robin",
  "Dorian", "Casey", "Dana", "Kim", "Shannon"
) %>%
  enframe(name = "fd_rank", value = "name")

Step 2A: Compute the RMSE for Jessie

jessie %>%
  mutate(
    error = pct_girls - 0.5,
    squared_error = error^2
  ) %>%
  summarize(
    mse = mean(squared_error),
    rmse = sqrt(mse)
  )
ABCDEFGHIJ0123456789
mse
<dbl>
rmse
<dbl>
0.0099807330.09990362

Step 2B: Compute the RMSE for all names

#collect all baby data, filter and reshape it
all_babies <- babynames %>%
  filter( year >= 1930 & year < 2012) %>%
  select(-prop) %>%
  pivot_wider(names_from = sex, values_from = n) %>%
   mutate(
  pct_girls = F / (F + M))%>%
  filter(!is.na(F) & !is.na(M)) 
find_rmse <- function(x) {
  x %>%
    mutate(
           error = pct_girls - 0.5,
           squared_error = error^2
    ) %>%
    summarize(
      mse = mean(squared_error),
      rmse = sqrt(mse)
    )
}
all_babies %>%
  group_by(name)%>%
  group_modify(~find_rmse(.x))

Step 2C: Rank and filter the list of names

#finds 1000 most popular names
popular_names <- all_babies %>%
  group_by(name)%>%
  summarize(
    total_years = n(),
    total_occurences = sum(F+M)
  ) %>%
  filter(total_years >= 70 & name != "Unknown")%>%
  arrange(desc(total_occurences))%>%
 head(1000)
popular_names
ABCDEFGHIJ0123456789
name
<chr>
total_years
<int>
total_occurences
<int>
Michael824202264
James824125512
Robert823890601
John823872824
David823397125
William822974869
Mary782540721
Richard812229191
Joseph822028334
Christopher741962547
pop_rmse <- all_babies%>%
  inner_join(popular_names, by = "name")
pop_rmse
ABCDEFGHIJ0123456789
year
<dbl>
name
<chr>
F
<int>
M
<int>
pct_girls
<dbl>
total_years
<int>
total_occurences
<int>
1930Mary641463400.994727538782540721
1930Patricia15752520.996709694731480993
1930Joan15480680.99562644782430557
1930Jean119842880.97653194382320636
1930Elizabeth10995480.995653355821219786
1930Frances106461270.98821126979285830
1930Evelyn9536500.99478406074253394
1930Anna9079520.99430511481459306
1930Nancy9069250.99725093573916855
1930Catherine6298320.99494470871426077
most_unisex_names <- pop_rmse %>%
  group_by(name)%>%
  group_modify(~find_rmse(.x))%>%
  arrange((rmse))%>%
  head(35)
most_unisex_names
ABCDEFGHIJ0123456789
name
<chr>
mse
<dbl>
rmse
<dbl>
Jessie0.0099807330.09990362
Alva0.0122058350.11048002
Marion0.0123768310.11125121
Carlin0.0136782730.11695415
Natividad0.0136965640.11703232
Michal0.0149600150.12231114
Jackie0.0168389700.12976506
Arie0.0188010230.13711682
Trinidad0.0200063730.14144388
Lorenza0.0208817480.14450518

Step 2D: Gather the data you need to draw the time series

data <- babynames %>%
  filter(
    year >= 1930 & year < 2012
  ) %>%
  select(-prop) %>%
  pivot_wider(names_from = sex, values_from = n) %>%
  mutate(pct_girls = F / (F + M)) %>%
  inner_join(most_unisex_names, by = c("name" = "name"))

Step 2E: Gather the data you need to draw the points

most_unisex_yr <- function(name_arg) {
  all_babies %>%
    filter(name == name_arg) %>%
    mutate(distance = abs(pct_girls - 0.5)) %>%
    arrange(distance) %>%
    head(1)
}

names_list <- most_unisex_names %>% 
  select(-mse, -rmse) %>%
  deframe()
  
unisex_years <- map_dfr(names_list, most_unisex_yr) 

unisex_years
ABCDEFGHIJ0123456789
year
<dbl>
name
<chr>
F
<int>
M
<int>
pct_girls
<dbl>
distance
<dbl>
1949Jessie103110230.50194740.001947420
1972Alva29290.50000000.000000000
1977Marion2292280.50109410.001094092
1945Carlin990.50000000.000000000
1987Natividad15150.50000000.000000000
1990Michal69690.50000000.000000000
2006Jackie1181190.49789030.002109705
1960Arie11110.50000000.000000000
1934Trinidad43430.50000000.000000000
1983Lorenza22220.50000000.000000000

Step 2F: Polish the data

all_babies <- all_babies %>% filter(name != "Unknown")

most_unisex_yr <- function(name_arg) {
  all_babies %>%
    filter(name == name_arg) %>%
    mutate(distance = abs(pct_girls - 0.5)) %>%
    arrange(distance) %>%
    head(1)
}

names_list <- most_unisex_names %>% 
  select(-mse, -rmse) %>%
  deframe()
  
unisex_years <- map_dfr(names_list, most_unisex_yr) 

unisex_years
ABCDEFGHIJ0123456789
year
<dbl>
name
<chr>
F
<int>
M
<int>
pct_girls
<dbl>
distance
<dbl>
1949Jessie103110230.50194740.001947420
1972Alva29290.50000000.000000000
1977Marion2292280.50109410.001094092
1945Carlin990.50000000.000000000
1987Natividad15150.50000000.000000000
1990Michal69690.50000000.000000000
2006Jackie1181190.49789030.002109705
1960Arie11110.50000000.000000000
1934Trinidad43430.50000000.000000000
1983Lorenza22220.50000000.000000000

Step 2G: Create the annotations

map_dfr(c("Jessie", "Marion", "Jackie", "Ariel", "Jamie"), most_unisex_yr)
ABCDEFGHIJ0123456789
year
<dbl>
name
<chr>
F
<int>
M
<int>
pct_girls
<dbl>
distance
<dbl>
1949Jessie103110230.50194740.001947420
1977Marion2292280.50109410.001094092
2006Jackie1181190.49789030.002109705
1930Ariel880.50000000.000000000
1936Jamie49490.50000000.000000000
general_context <- tribble(
  ~year_label, ~vpos, ~hjust, ~name, ~text,
  1934, 0.35, "left", "Jessie", "Most\nunisex year",
  1977, 0.35, "right", "Marion", "Marion Jones wins\ngold in Olympics", 
  2006, 0.35, "top", "Jackie", "Jackie Robinson to\nmajor league", 
  1930, 0.35, "right", "Ariel", "The Little Mermaid\nsways Ariel towards girls",
  1936, 0.35, "top", "Jamie", "Jamie Hunter Cartwright\nappears on Bonanza"
)

general_segments <- tribble(
  ~year, ~pct_girls, ~name, 
  1940, 0.43, "Jessie",
  1940, 0.5, "Jessie",
  1949, 0.4956897, "Jessie",
  1940, 0.23, "Marion",
  1940, 0.5, "Marion", 
  1977, 0.5, "Marion", 
  1980, 0.33, "Jackie",
  1980, 0.498, "Jackie",
  2006, 0.498, "Jackie",
  1923, 0.23, "Ariel",
  1923, 0.498, "Ariel",
  1930, 0.498, "Ariel",
  1928, 0.23, "Jamie", 
  1936, 0.5, "Jamie",
  1936, 0.5, "Jamie"
)

Step 2H: Order the facets

ranked_names <- most_unisex_names %>%
  mutate(
        fct_rmse = factor(rmse),
        name_rank= dense_rank(fct_rmse),
        name_label = paste(name_rank, name, sep = ".")
    )
ranked_names
ABCDEFGHIJ0123456789
name
<chr>
mse
<dbl>
rmse
<dbl>
fct_rmse
<fct>
name_rank
<int>
name_label
<chr>
Jessie0.0099807330.099903620.09990361703494711.Jessie
Alva0.0122058350.110480020.1104800226723511.Alva
Marion0.0123768310.111251210.11125120508700611.Marion
Carlin0.0136782730.116954150.11695414990702111.Carlin
Natividad0.0136965640.117032320.11703232140240711.Natividad
Michal0.0149600150.122311140.12231113952507511.Michal
Jackie0.0168389700.129765060.12976505529480111.Jackie
Arie0.0188010230.137116820.13711682192578811.Arie
Trinidad0.0200063730.141443880.14144388475896611.Trinidad
Lorenza0.0208817480.144505180.14450518351264511.Lorenza

Step 2I: Draw the plot

ggplot(data, aes(x = year, y = pct_girls)) +
  geom_line() +
  geom_area(fill = "#eaac9e") +
  facet_wrap(~name, scales='free_x', ncol = 7) +
  geom_point(data = unisex_years, fill = "white", pch = 21, size = 2.8) +
  geom_text(
    data = jessie_labels, 
    aes(label = label), 
    color = "white"
  ) +
  geom_text(
  data = general_context, family = "Century Gothic",
    aes(x = year_label, y = vpos, label = text, hjust = hjust), 
    vjust = "top",size=2
  ) +
  scale_y_continuous(NULL, 
    limits = c(0, 1),
    breaks = c(0, 0.5, 1),
    labels = scales::percent,
    expand = c(0,0)
  ) +
    geom_path(data = general_segments, aes(x = year, y = pct_girls)) +
  scale_x_continuous(breaks = c(1940, 1960, 1980, 2000), 
                     labels = c("1940", "'60", "'80", "2000"), 
                     expand = c(0,0),
                     NULL
                     ) +
  scale_fill_manual(values = c("#eaac9e", "black")) +
  theme(
    panel.background = element_rect(fill = "#92bdd3"),
    axis.ticks.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(hjust = 0, face = "bold", size = 14)
  ) +
  guides(fill = FALSE) +
  labs(
    caption = "Source: Social Security Administration | By http://flowingdata.com"
  )

Word count

Method koRpus stringi
Word count 822 811
Character count 4712 4711
Sentence count 51 Not available
Reading time 4.1 minutes 4.1 minutes

Standards

In this assignment, we attempted the following standards:

  • : We mastered the Wrangling standard because we utilized functions such as group_by() and group_modify(), as well as mutate to optimize the code for readability and performance.
  • : We mastered the Relational standard because we utilized the inner_join function to include potentially missing data and merge various data frames.
  • : We mastered the Reshape standard because we transformed the various data frames with pivot functions for the lists.
  • : We mastered the Aesthetics standard because we customized and utilized a wide range of available geoms and color palettes, along with specific labeling of the graphic.
  • : We mastered the R Markdown standard because we employed formatting within the Markdown file to include a variety of code chunks, text, and graphics.